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Abstract 

Branching processes are a class of continuous-time Markov chains (CTMCs) with ubiqui¬ 
tous applications. A general difficulty in statistical inference under partially observed CTMC 
models arises in computing transition probabilities when the discrete state space is large or un¬ 
countable. Classical methods such as matrix exponentiation are infeasible for large or countably 
infinite state spaces, and sampling-based alternatives are computationally intensive, requiring a 
large integration step to impute over all possible hidden events. Recent work has successfully 
applied generating function techniques to computing transition probabilities for linear multi¬ 
type branching processes. While these techniques often require significantly fewer computations 
than matrix exponentiation, they also become prohibitive in applications with large popula¬ 
tions. We propose a compressed sensing framework that signihcantly accelerates the generating 
function method, decreasing computational cost up to a logarithmic factor by only assuming the 
probability mass of transitions is sparse. We demonstrate accurate and efficient transition prob¬ 
ability computations in branching process models for hematopoiesis and transposable element 
evolution. 


1 Introduction 


Continuous-time branching processes are widely used in stochastic modeling of population dynam¬ 
ics, with applications including biology, genetics, epidemiology, quantum optics, and nuclear hssion 
[Renshaw 2011 . With the exception of the well-studied class of birth-death processes, which have 


known expressions for many quantities relevant to probabilistic inference [Crawford et al. 2014 


branching processes pose significant inferential challenges. In particular, closed forms for hnite-time 
transition probabilities, the conditional probability that a trajectory ends at a given state, given a 
starting state and time interval, are unavailable. These transition probabilities are crucial in many 
inferential approaches, comprising the observed likelihood function when data from the process are 
available at a set of discrete times. The likelihood function is of central importance in frequentist 
and Bayesian methods, and any statistical framework involving observed data likelihood evaluation 
requires transition probability computations. Without the ability to fully leverage the branching 
structure, studies must rely on general CTMC estimation techniques or model approximations 


Rosenberg et al. 2003, Golinelli et al., 2006, El-Hay et al., 2006 


Computation of transition probabilities is the usual bottleneck in model-based inference us¬ 
ing CTMCs [Hajiaghayi et ah 2014 , requiring a marginalization over the infinite set of possible 
end-point conditioned paths. Classically, this marginalization is accomplished by computing the 
matrix exponential of the inhnitesimal generator of the CTMC. However, this procedure has cubic 
runtime complexity in the size of the state space, becoming prohibitive even for state spaces of 
moderate sizes. Alternatives also have their shortcomings: uniformization methods use a discrete¬ 
time “skeleton” chain to approximate the CTMC but rely on a restrictive assumption that there is 
a uniform bound on all rates [Grassmann , 1977| Rao and Teh[ 2011] . Typically, practitioners resort 
to sampling-based approaches via Markov chain Monte Carlo (MCMC). Specihcally, particle-based 
methods such as sequential Monte Carlo (SMC) and particle MCMC [Doucet et al. 2000, Andrieu 
et al., 2010 offer a complementary approach whose runtime depends on the number of imputed 
transitions rather than the size of the state space. However, these SMC methods have several 
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limitations— in many applications, a prohibitively large number of particles is required to impute 
waiting times and events between transitions, and degeneracy issues are a common occurrence, 

(MI 


especially in longer time series. A method by Hajiaghayi et al. 


accelerates particle-based 
methods by marginalizing holding times analytically, but has cubic runtime complexity in the num¬ 
ber of imputed jumps between observations and is recommended for applications with fewer than 
one thousand events occurring between observations. 

Recent work by Xu et al. 2014| has extended techniques for computing transition probabilities in 
birth-death models to linear multi-type branching processes. This approach involves expanding the 
probability generating function (PGF) of the process as a Fourier series, and applying a Riemann 
sum approximation to its inversion formula. This technique has been used to compute numerical 
transition probabilities within a maximum likelihood estimation (MLE) framework, and has also 
been applied within Expectation Maximization (EM) algorithms [Doss et al. 2013, Xu et aHf 
2014 . While this method provides a powerful alternative to simulation and avoids costly matrix 


operations, the Riemann approximation to the Fourier inversion formula requires 0{N^) PGF 
evaluations, where b is the number of particle types and N is the largest population size at endpoints 
of desired transition probabilities. This complexity is no worse than linear in the size of the state 
space, but can also be restrictive: a two-type process in which each population can take values in 
the thousands would require millions of PGF evaluations to produce transition probabilities over an 
observation interval. This can amount to hours of computation in standard computing architectures, 
because evaluating PGFs for multitype branching processes involves numerically solving systems 
of ordinary differential equations (ODEs). Such computations become infeasible within iterative 
algorithms. 

In this paper, we focus our attention on the efficient computation of transition probabilities 
in the presence of sparsity, presenting a novel compressed sensing generating function (GSGF) 
algorithm that dramatically reduces the computational cost of inverting the PGE. We apply our 
algorithm to a branching process model used to study hematopoiesis as well as a birth-death-shift 
process with applications to molecular epidemiology, and see that the sparsity assumption is valid 
for scientifically realistic rates of the processes obtained in previous statistical studies. We compare 
performance of GSGF to transition probability computations without taking advantage of sparsity, 
demonstrating a high degree of accuracy while achieving significant improvements in runtime. 


2 Markov Branching Processes 

A linear multitype branching process follows a population of independently acting particles that 
reproduce and die. The random vector X(t) takes values in a discrete state space D at time t, 
with Xi{t) denoting the number of type i particles present at time t. For exposition and notational 
simplicity, we will focus on the two-type case. In the continuous-time setting, each type i particle 
produces k type 1 particles and I type 2 particles with instantaneous rates aj{k,l), and the rates 
of no event occurring are defined as 


ai := ai(l,0) = - ^ ai{k,l), 

02 := 02 ( 0 , 1 ) = - ^ a2{k,l) 

{k,1)^(0,1) 

so that = 0 for z = 1,2. Offspring of each particle evolve according to the same set 

of instantaneous rates, and these rates aj{k,l) do not depend on t so that the process is time- 
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homogeneous. These assumptions imply that each type i particle has exponentially distributed 
lifespan with rate —a*, and X(t) evolves over time as a CTMC [Guttorp 1995|. 


2.1 Transition probabilities 

Dynamics of a CTMC are determined by its transition function 

fx,y(t) = Pr(X(t + s) = y|X(s) = x), (1) 

where time-homogeneity implies independence of the value of s on the right hand side. When the 
state space D is small, one can exponentiate the |D| by |D| infinitesimal generator or rate matrix 
Q = {9x,y }x ygQ) where the entries gx,y denote the instantaneous rates of jumping from state x to 
y, to compute transition probabilities: 


= = (2) 

k=0 

These transition probabilities are fundamental quantities in statistical inference for data generated 
from CTMCs. For instance, if X(t) is observed at times ti,... ,tj and D represents the 2 by J 
matrix containing the observed data, the observed log-likelihood is given by 


J-i 

4(D; 0) = Y^ logpx{q),X(q+i)(ij-hi “ tj; 0) (3) 

i=i 

where the vector 6 parametrizes the rates aj{k, 1). Maximum likelihood inference that seeks to find 
the value 6 that optimizes Q as well as Bayesian methods where likelihood calculations arise in 
working with the posterior density (up to a proportionality constant) fundamentally rely on the 
ability to calculate transition probabilities. Having established their importance in probabilistic 
inference, we focus our discussion in this paper to computing these transition probabilities in a 
continuous-time branching process. 


2.2 Generating function methods 

Matrix exponentiation is cubic in |D| and thus prohibitive in many applications, but we may take 
an alternate approach by exploiting properties of the branching process. Xu et ah [2014 extend a 
generating function technique used to compute transition probabilities in birth-death processes to 
the multi-type branching process setting. The probability generating function (PGF) for a two-type 
process is defined 

= =j,X2(0) = A:) 

oo oo 

= EE P{jk ),) 


( 4 ) 


1=0 m =0 


this definition extends analogously for any m-type process. We suppress dependence on 6 for no- 


tational convenience. Bailey 1964 provides a general technique to derive a system of differential 


equations governing (jijk using the Kolmogorov forward or backward equations given the instan¬ 
taneous rates aj{k,l). It is often possible to solve these systems analytically for (pjk, and even 
when closed forms are unavailable, numerical solutions can be efficiently obtained using standard 
algorithms such as Runge-Kutta methods Butcher, 1987]. 
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With (pji^ available, transition probabilities are related to the PGF @ via differentiation: 

Ql Qm 

P{jk),(lm){t) = 

O'*! 0*2 si=S2=0 

This repeated differentiation is computationally intensive and numerically unstable for large l,m, 
but following [Lange [1982 , we can map the domain si, S 2 £ [0,1] x [0,1] to the boundary of the 
complex unit circle, instead setting si = , S 2 = generating function becomes a 

Fourier series whose coefficients are the desired transition probabilities 


l,m=0 

Applying a Riemann sum approximation to the Fourier inversion formula, we can now compute the 
transition probabilities via integration instead of differentiation: 

r-l /■! 





0 JO 
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( 6 ) 


-^=0 11=0 


^ —27Tilu/N—27TimvlN 
X c c 


In practice, the set of transition probabilities S = {p(jk),{im){t)} for alH, m = 0,..., A^, given 
initial values of {j, k), can be obtained via the Fast Fourier Transform (FFT), described in Section 
1^ It is necessary to choose N > l,m, since exponentiating the roots of unity can yield at most N 
distinct values 

^—27TimvlN _ ^—27ri{mv modN)/N. 


this is related to the Shannon-Nyquist criterion [Shannon 2001 , which dictates that the number 
of samples required to recover a signal must match its highest frequency. Thus, calculating “high 
frequency” coefficients— when l,m take large values—requires 0{N‘^) numerical ODE solutions, 
which becomes computationally expensive for large N. 


Sparsity: Given an initial state X(0) = {j,k), the support of transition probabilities is often 
concentrated over a small range of {l,m) values. For example, if X(f) = (800,800), then the 
probability that the entire process becomes extinct, X(t + s) = (0,0), is effectively zero unless 
particle death rates are very high or s is a very long time interval. In many realistic applications, 
P{ 800 , 800 ),{i,m){s) has non-negligible mass on a small support, for instance only over l,m values 
between 770 and 820. While their values can be computed using Equation Q for a choice of 
N > 820, requiring iV^ ODE evaluations toward computing only (820 — 770)^ nonzero probabilities 
seems wasteful. To exploit the sparsity information in such a setting, we bridge aforementioned 
branching process techniques to the literature of compressed sensing. 


3 Compressed Sensing 

Originally developed in an information theoretic setting, the principle of compressed sensing (CS) 
states that an unknown sparse signal can be recovered accurately and often perfectly from signif¬ 
icantly fewer samples than dictated by the Shannon-Nyquist rate at the cost of solving a convex 
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optimization problem [Donoho 2006, Candes, 2006 . CS is a robust tool to collect high-dimensional 


sparse data from a low-dimensional set of measurements and has been applied to a plethora of helds, 
leading to dramatic reductions in the necessary number of measurements, samples, or computa¬ 
tions. In our setting, the transition probabilities play the role of a target sparse signal of Fourier 
coefficients. The data reduction made possible via CS then translates to reducing computations to 
a random subsample of PGF evaluations, which play the role of measurements used to recover the 
signal. 


3.1 Overview 

In the CS framework, the unknown signal is a vector x G observed through a measurement 
b = Vx G with M << N. Here V denotes an M x N measurement matrix or sensing 

matrix. Since M < N, the system is underdetermined and inversion is highly ill-posed—the space 
of solutions is an infinite affine subspace, but CS theory shows that recovery can be accomplished 
under certain assumptions by seeking the sparsest solution. Let -0 be an orthonormal basis of 
that allows a iL-sparse representation of x: that is, x = 0s where s is a sparse vector of coefficients 
Candes 2006[ proves that recovery can then be accurately accomplished by 


such that 


< K. 


hnding the sparsest solution 


s = argmm | 

S 


s||o 


s.t. As = b 


(7) 


where A = V0 is the composition of the measurement and sparsifying matrices. In practice, this 
non-convex objective is combinatorially intractable to solve exactly, and is instead solved by proxy 
via ^i-relaxation, resulting in a convex optimization program. In place of Equation 0 , we optimize 
the unconstrained penalized objective 


1,1 1,2 

s = argmin - || As — b ||2 . 

s 2 


0 1 


( 8 ) 


where A is a regularization parameter enforcing sparsity of s. The signal x, or equivalently s, can be 
recovered perfectly using only M = CKlogN measurements for some constant C when A satisfies 
the Restricted Isometry Property (RIP) [Candes and Tao 2005, Candes, 2008 —briefly, this requires 


that V and 0 to be incoherent so that rows of V cannot sparsely represent the columns of 0 and 
vice versa. Coherence between V, 0 is defined as 


Mv,0) = 


nmax l(V,00|, 


and low coherence pairs are desirable. It has been shown that choosing random measurements V 
satisfies RIP with overwhelming probability Candes, 2008 . Further, given 0, it is often possible 


to choose a known ideal distribution from which to sample elements in V such that V and 0 are 
maximally incoherent. 


3.2 Higher dimensions 


CS theory extends naturally to higher-dimensional signals [Candes 2006 . In the 2D case which 
will arise in our applications (Section]^, the sparse solution S G and measurement 


B = ASA^ G 


-"MxM 


are matrices rather than vectors, and we solve 

1 , 


S = argmin-|| AS A"' — B|| 2 -|-A||S||i. 
s 2 


(9) 


( 10 ) 
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This can always be equivalently represented in the vector-valued framework: vectorizing 


vec(S) = s G C 


Ar2 


vec(B) = b G 


we now seek b = y4s as in Equations Q , Q , where A = A (8) A is the Kronecker product of A with 
itself. In practice, it can be preferable to solve (10), since the number of entries in A grows rapidly 


and thus the vectorized problem requires a costly construction of A and can be cumbersome in 
terms of memory. 


4 CSGF Method 

We propose an algorithm that allows for efficient PGF inversion within a compressed sensing 
framework. We focus our exposition on two-type models: linear complexity in |0| is less often 
a bottleneck in single-type problems, and all generating function methods as well as compressed 
sensing techniques we describe extend to higher dimensional settings. 

We wish to compute the transition probabilities Pjk,imi't) given any t > 0 and X(0) = {j,k). 
These probabilities can be arranged in a matrix S G with entries 

Without the CS framework, these probabilities are obtained following Equation ([^ by first com¬ 
puting an equally sized matrix of PGE solutions 


B = 




■^NxN 


( 11 ) 


For large N, obtaining B is computationally expensive, and our method seeks to bypass this 

step. When B is computed, transition probabilities are then recovered by taking the fast Fourier 

transform S = fft(B). To better understand how this fits into the CS framework, we can equivalently 

~ T 

write the fast Fourier transform in terms of matrix operations S = FBF , where F G denotes 

the discrete Fourier transform matrix (see Supplement). Thus, the sparsifying basis i/? is the Inverse 
Discrete Fourier Transform (IDFT) matrix = F* given by the conjugate transpose of F, and we 
have B = . 

When the solution matrix S is expected to have a sparse representation, our CSGF method seeks 
to recover S without computing the full matrix B, instead beginning with a much smaller set of 
PGF evaluations B G corresponding to random entries of B selected uniformly at random. 

Denoting randomly sampled indices I, this smaller matrix is a projection B = ASA^ in the form 
of Equation ([^ where A G is obtained by selecting a subset of rows of -0 corresponding to 

X. Uniform sampling of rows corresponds to multiplying by a measurement matrix encoding the 
spike basis (or standard basis): formally, this fits into the framework described in Section 3.1 


as 


A = V0, with measurement matrix rows Vj(Z) = 5{j — 1). The spike and Fourier bases are known 
to be maximally incoherent in any dimension, so uniformly sampling indices X is optimal in our 
setting. 

Now in the compressed sensing framework, computing the reduced matrix B only requires a 
logarithmic proportion |B| oc A log |B| of PGF evaluations necessary in Equation ©• Computing 
transition probabilities in S is thus reduced to a signal recovery problem, solved by optimizing the 


objective in Equation (10) 
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4.1 Solving the problem 


There has been extensive research on algorithms for solving the ii regularization objective in 
Equation ([^ and related problems Tibshirani, 1996, Beck and Teboulle, 2009a| . As mentioned 
previously, vectorizing the problem so that it can be represented in the form (|^ requires wasteful 
extra memory; instead we choose to solve the objective in Equation (10) using a proximal gradient 
descent (PGD) algorithm. 

PGD is useful for solving minimization problems with objective of the form f{x) = g{x) + h{x) 
with g convex and differentiable, and h convex but not necessarily differentiable. Letting 


5(S) = ^||ASA^-B| 


|2 

l2) 


/i(S) = A||S| 


1) 


we see that Equation ( |10| ) satisfies these conditions. A form of generalized gradient descent, PGD 
iterates toward a solution with 


Xk+i = argmin[g'(a:fc) + Vg{xky{z - Xk) 

Z 

+ ~ ^k\\\ + h{z)\, 

zLu 


( 12 ) 


where Lk is a step size that is either fixed or determined via line-search. This minimization has 
known closed-form solution 


Xk+i = softh(xfc - LkVg{xk),LkX), 
where softh is the soft-thresholding operator 

[softh(x, a)\i = sgn(xi) max(|xi| — a, 0). 


(13) 


(14) 


Alternating between these steps results in an iterative soft-thresholding algorithm that solves the 


convex problem (10) with rate of convergence 0{l/k) when Lk is fixed. The softh() operation is 


simple and computationally negligible, so that the main computational cost is in evaluating \tg{xk)- 
We derive a closed form expression for the gradient in our setting 


VgiS) = -A*(B - ASA^)A, 


(15) 


where A, A* denote complex conjugate and conjugate transpose of A respectively. In practice, the 
inner term ASA^ is obtained as a subset of the inverse fast Fourier transform of S rather than 
by explicit matrix multiplication. The computational effort in computing 'Vg{S) therefore involves 
only the two outer matrix multiplications. 

We implement a fast variant of PGD using momentum terms [Beck and Teboiille 2009b based 
on an algorithm introduced by Nesterov, and select step sizes Lk via a simple line-search subroutine 
[Beck and Teboiille 2009a| . The accelerated version includes an extrapolation step, where the soft- 
thresholding operator is applied to a momentum term 

Vk+l ^k T ^ki^k ^k—l) 

rather than to Xk] here ojk is an extrapolation parameter for the momentum term. Remarkably, 
the accelerated method still only requires one gradient evaluation at each step as Vk+i is a simple 
linear combination of previously computed points, and has been proven to achieve the optimal 
worst-case rate of convergence 0(1/A:^) among first order methods Nesterov, 1983 . Similarly, the 


line-search procedure involves evaluating a bound that also only requires one evaluation of Vgf (see 
Supplement). 

Algorithm provides a summary of the CSGF method in pseudocode. 
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Algorithm 1 CSGF algorithm. 


1 : 

2 : 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 


Input: initial sizes Xi = j,X 2 = k, time interval t, branching rates 6 , signal size N > j,k, 
measurement size M, penalization constant A > 0, line-search parameters L, c. 

Uniformly sample M indices I C [0,... A — 1] /A 
Compute B = { 

Define A = i/jj. the X rows of IDFT matrix xj) 

Initialize: Si = Yi = 0 
for A; = 1, 2,..., {max iterations} do 
Choose Lk = line-search(L, c, Yfc) 

Update extrapolation parameter Uk = -j^ 

Update momentum Y^+i = -|- oJki^k — ^k-i) 

Compute Vg(Yk+i) according to (15) 

Update Sfc+i = softh(Sfc - LkVg(Yk+i), LkX) 
end for 

return S = Sfc+i 


5 Examples 

We will examine the performance of CSGF in two applications: a stochastic two-compartment 
model used in statistical studies of hematopoiesis, the process of blood cell production, and a 
birth-death-shift model that has been used to study the evolution of transposons, mobile genetic 
elements. 


5.1 Two-compartment hematopoiesis model 

Hematopoiesis is the process in which self-sustaining primitive hematopoietic stem cells (HSCs) 
specialize, or differentiate, into progenitor cells, which further specialize to eventually produce 
mature blood cells. In addition to far-reaching clinical implications — stem cell transplantation is 
a mainstay of cancer therapy — understanding hematopoietic dynamics is biologically interesting, 
and provides critical insights of general relevance to other areas of stem cell biology [Orkin and 
Zon, 2008| . The stochastic model, depicted in Figure [T| has enabled estimation of hematopoietic 


2009 


rates in mammals from data in several studies Catlin et al., 2001 Golinelli et al., 2006 Fong et al. 


Without the ability to compute transition probabilities, an estimating equation approach by 


Catlin et al. 2001| is statistically inefficient, resulting in uncertain estimated parameters with very 
wide conhdence intervals. Nonetheless, biologically sensible rates are inferred. Golinelli et al. [2006j 


observe that transition probabilities are unknown for a linear birth-death process (compartment 
1) coupled with an inhomogeneous immigration-death process (compartment 2), motivating their 
computationally intensive reversible jump MCMG implementation. 

However, we can equivalently view the model as a two-type branching process. Under such a 
representation, it becomes possible to compute transition probabilities via Equation ([^. The type 
one particle population Ai corresponds to hematopoietic stem cells (HSGs), and X 2 represents 
progenitor cells. With parameters as denoted in Figure the nonzero instantaneous rates dehning 
the process are 

ai(0, l) = i/ ai{l,0) = -{p + u) 

02 ( 0 ,1) = -pL. 


ai(2,0) = p 

02 ( 0 , 0 ) = p. 


ai(l,0) = -{p + i^) 


(16) 
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Figure 1: HSCs can self-renew, 
producing new HSCs at rate p, or 
differentiate into progenitor cells 
at rate u. Further progenitor dif¬ 
ferentiation is modeled by rate p. 


Having specified the two-type branching process, we derive solutions for its PGF, defined in 
Equation Q, with details in the Supplement: 


Proposition 5.1 The generating function for the two-type model described in (16) is given by 
^jk = where 


si, S 2 ) = 1 -h (s2 - 

^ ^t(l^i,o{t^si,S2) = p(j)ifi{t,si,S2) - ip-^iy)(l)i,o{t,si,S2) (17) 

+l^4>0,l{t,Si,S2). 

We see that (pop has closed form solution so that evaluating (pj^ only requires solving one ODE 
numerically, and with the ability to compute (pjk, we may obtain transition probabilities using 
Equation ([^. In this application, cell populations can easily reach thousands, motivating the 
CSGF approach to accelerate transition probability computations. 


5.2 Birth-death-shift model for transposons 


Our second application examines the birth-death-shift (BDS) process proposed by Rosenberg et al. 
[2003 to model evolutionary dynamics of transposable elements or transposons, genomic mobile 


sequence elements. Each transposon can (1) duplicate, with the new copy moving to a new genomic 
location; (2) shift to a different location; or (3) be removed and lost from the genome, independently 
of all other transposons. These respective birth, shift, and death events occur at per-particle 
instantaneous rates (3, a, 6, with overall rates proportional to the total number of transposons. 
Transposons thus evolve according to a linear birth-death-shift Markov process in continuous time. 
In practice, genotyping technologies allow for this process to be discretely monitored, necessitating 
computation of finite-time transition probabilities. 

2003j estimate evolutionary rates of the IS hi id transposon in the Mycobac- 


Rosenberg et al. 


terium tuberculosis genome from a San Francisco community study dataset Cattamanchi et al. 


2006 . Without transition probabilities, the authors maximize an approximate likelihood by assum¬ 


ing at most one event occurs per observation interval, a rigid assumption that severely limits the 


range of applications. Doss et al 


revisit their application, inferring similar rates of IS hi Id 


evolution using a one-dimensional birth-death model that ignores shift events. Xu et al. [2014 show 
that the BDS model over any finite observation interval can be modeled as a two-type branching 
process, where Xi denotes the number of initially occupied genomic locations and X 2 denotes the 
number of newly occupied locations (see figure in Supplement). In this representation, full dy¬ 
namics of the BDS model can be captured, and generating function techniques admit transition 
probabilities, leading to rate estimation via MLE and EM algorithms. Transposon counts in the 
tuberculosis dataset are low, so that Equation (|^ can be computed easily, but their method does 
not scale well to applications with high counts in the data. 
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k, number of progenitors 


Figure 2: Illustrative example of recovered transition probabilities in hematopoiesis model described 
in Section Beginning with 15 HSCs and 5 progenitors over a time period of one week, the CSGF 
solution S = j,k = 0, perfectly recovers transition probabilities S, using 

fewer than half the measurements. 


The nonzero rates defining the two-type branching process representation of the BDS model 
are given by 


ai(l,l) = 13, 
ai(l,0) = -{/3 -I- fT + 5), 
02(0,1) = -{P + 6), 


ai(0,1) = a, 

a2(0,2) = /3, 

02(0, 0 ) = 6 . 


ai(0,0) = 6 , 


and its PGF is governed by the following system derived in |Xu et alT| |2014| 


1 -1 


4>o,i{t, si, S 2 ) — 1 + 
ji^i,o{t, Si, S 2 ) = /3(/>i,o</>2 + o-po,! + 6 - {P + a + 5)si, 


(18) 


(19) 


again with <pjk = Pi q(/>o 1 by particle independence. 


5.3 Results 


To compare the performance of GSGF to the computation of Equation Q without considering 
sparsity, we first compute sets of transition probabilities S of the hematopoiesis model nsing the 


full set of PGF solution measurements B as described in Equation (11). These “true signals 
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are compared to the signals computed using CSGF S, recovered using only a random subset of 
measurements B following Algorithmic Figureprovides an illustrative example with small cell 
populations for visual clarity— we see that the support of transition probabilities is concentrated 
(sparse), and the set of recovered probabilities S is visually identical to the true signal. 

In each of the aforementioned applications, we calculate transition probabilities S G for 

maximum populations N = 2^, 2®,... 2^^, given rate parameters 6 , initial population X(0), and 
time intervals t. Each computation of S requires N'^ numerical evaluations of the ODE systems 
([iC), Eor each value of iV, we repeat this procedure beginning with ten randomly chosen sets 
of initial populations X(0) each with total size less than N. We compare the recovered signals S 
computed using CSGE to true signals S, and report median runtimes and measures of accuracy 
over the ten trials, with details in the following sections. 




Eigure 3: Randomly selected probabilities and largest probabilities recovered using GSGE are 
nearly identical to their true values. Probabilities displayed here correspond to a randomly selected 
BDS model trial with N=512; transition probabilities S via GSGE are recovered from a sample B 
requiring fewer than 2% of ODE computations used to compute S = fft(B). 


Parameter settings: In the hematopoiesis example, we set per-week rates 0hema = (0.125,0.104,0.147) 
and observation time t = 1 week based on biologically sensible rates and observation time scales 
of data from previous studies of hematopoiesis in mammals [Gatlin et al. 2001, Golinelli et al., 2006| 

Eor the BDS application, we set per-year event rates 0bds = (0.0156,0.00426,0.0187) 
and t = .35 years, the average length between observations in the 


Eong et al., 2009 


estimated in Xu et al., 2014 


San Erancisco tuberculosis dataset [Gattamanchi et 2006 


In each case, we computed = 3K log N'^ total random measurements to obtain B for GSGE, 
and we set the regularization parameters Ahsc = V^og^, Abds = logM, with more regularization 
in the BDS application as lower rates and a shorter observation interval leads us to expect more 
sparsity. While careful case-by-case tuning to choose A, M would lead to optimal results, we set 
them in this simple manner across all trials to demonstrate a degree of robustness, still yielding 
promising performance results. In practice one may apply standard cross-validation procedures to 
select A, M, and because the target solution is a set of transition probabilities, checking that entries 
in the recovered solution S sum close to 1 offers a simpler available heuristic. Einally, though one 
may expedite convergence of PGD by supplying an informed initial guess with positive values near 
values X(0) in practice, we initialize PGD with an uninformative initial value Si = 0 in all cases. 
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Table 1: Runtimes and error, birth-death-shift model. 


N 

M 

Time (sec), 

B e 

Time (sec), 
B e c^xM 

Time (sec), 
PGD 

^max — 

\Pij,kl Pij,kl\msix 

^rel — 

^msix/\Pij,kl\ma,x 

128 

25 

39.7 

2.3 

1.0 

5.27 X 10"^ 

2.77 X 10-^ 

256 

33 

150.2 

3.8 

7.8 

4.86 X 10-3 

4.71 X 10-2 

512 

45 

895.8 

7.8 

25.3 

2.71 X 10-3 

4.68 X 10-2 

1024 

68 

2508.9 

18.6 

58.2 

1.41 X 10-3 

5.12 X 10-2 

2048 

101 

9788.3 

26.1 

528.3 

8.10 X 10-^ 

4.81 X 10-2 

4096 

150 

40732.7 

57.4 

2234.7 

4.01 X 10-'^ 

5.32 X 10-2 


Table 2: Runtimes and error, hematopoiesis model 


N 

M 

Time (sec), 
B E C^xTV 

Time (sec), 
B e c^xM 

Time (sec), 
PGD 

^max — 

\Pij,kl Pij,kl\max 

^rel — 

^max /1 Pij, fc/1 max 

128 

43 

108.6 

9.3 

0.64 

9.41 X 10-^ 

2.25 X 10-2 

256 

65 

368.9 

22.1 

2.1 

9.44 X 10-^ 

4.73 X 10-2 

512 

99 

922.1 

44.8 

8.5 

3.23 X 10-^ 

3.60 X 10-2 

1024 

147 

5740.1 

118.1 

41.9 

2.27 X 10-^ 

5.01 X 10-2 

2048 

217 

12754.8 

145.0 

390.0 

1.29 X 10-^ 

5.10 X 10-2 

4096 

322 

58797.3 

310.7 

2920.3 

9.43 X 10-3 

6.13 X 10-2 


Accuracy: In both models and for all values of N, each signal was reconstructed very accurately. 
Errors are reported in Tablesandfor the BDS and hematopoiesis models respectively. Maximum 
absolute errors for each CSGF recovery 

£max = max|{S}M - {S};;,; I = max|%fc;(t) - Pij,ki{t)\ 

kl kl 

are on the order of 10“^ at worst. We also report a measure of relative error, and because £ max 
is typically attained at large probabilities, we include the maximum absolute error relative to the 
largest transition probability 

^max 

providing a more conservative measure of accuracy. We still see that Grei is on the order of 10“^ in 
all cases. Visually, the accuracy of CSGF is stark: Figure [^provides a side-by-side comparison of 
randomly selected transition probabilities recovered in the BDS model for N = 2^. 


Running Times: Tablesandshow dramatic improvements in runtime using CSGF, reducing 
the number of ODE computations logarithmically. For instance, with N = 4096, we see the time 
spent on PGF evaluations necessary for CSGF is less than 0.1% of the time required to compute S in 
the BDS model, and around 0.5% of computational cost in the less sparse hematopoiesis application. 
Including the time required for solving Equation (10) via PGD, we see that computing S using GSGF 
reduces runtime by two orders of magnitude, requiring less than 6% of total computational time 
spent toward computing S in the worst case. We remark that ODE solutions are computed using a C 
implementation of efficient solvers via package deSolve, while we employ a naive R implementation 
of PGD. We emphasize the logarithmic reduction in required numerical ODE solutions; an optimized 
implementation of PGD reducing R overhead will yield further real-time efficiency gains. 
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6 Discussion 


We have presented a novel adaptation of recent generating function techniques to compute branch¬ 
ing process transition probabilities within the compressed sensing paradigm. While generating 
function approaches bypass costly matrix exponentiation and simulation-based techniques by ex¬ 
ploiting mathematical properties in the branching structure, our contribution now makes these 
techniques scalable by additionally harnessing the available sparsity structure. We show that when 
sparsity is present in the set of transition probabilities, computational cost can be reduced up to a 
logarithmic factor over existing methods. Note that sparsity is the only additional assumption nec¬ 
essary to apply our CSGF method—no prior knowledge about where transition probabilities have 
support is necessary. Many real-world applications of branching process modeling feature such 
sparsity, and we have seen that CSGF achieves accurate results with significant efficiency gains in 
two such examples with realistic parameter settings from the scientific literature. Transition prob¬ 
abilities are often important, interpretable quantities in their own right, and are necessary within 
any likelihood-based probabilistic framework for partially observed CTMCs. Their tractability us¬ 
ing CSGF opens doors to applying many Bayesian and frequentist tools to settings in which such 
methods were previously infeasible. Finally, we note that other statistically relevant quantities 
such as expectations of particle dwell times and restricted moments can be computed using sim¬ 
ilar generating function techniques Minin and Suchard 2008 , and the CSGF framework applies 
analogously when sparsity is present. 
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Supplement 


Discrete Fourier matrix 


The N hy N discrete Fourier transform matrix Ftv has entries 


{Fjvlyfc = 


1 

v/F 




with j. A: = 0,1,..., — 1 and u = , and as we mention in the main paper, the inverse Fourier 

transform matrix is given by its conjugate transpose. The partial M by IDFT matrices A 
necessary in Algorithm 1 is obtained by only computing and stacking a subset of M random rows 
from r/>. 


Line search subroutine 

We select step sizes with a simple line search algorithm summarized in the pseudocode below that 
works by evaluating an easily computed upper bound / on the objective /: 

fUZ, Y) := fiY) + Vf{Yf{Z -Y) + ^\\Z- y||i. (A-1) 

We follow Beck and Teboulle [2009], who provide further details. In implementation, we select 
L = .000005 and c = .5, and reuse the gradient computed in line-search for step 10 of Algorithm 
1 in the main paper. 

Derivation for hematopoiesis process PGF 

Given a two-type branching process defined by instantaneous rates ai{k,l), denote the following 
pseudo-generating functions for A = 1,2: 

Ui{si,S2) = 

k i 
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Algorithm 2 line-search procedure. 

1 : Input: initial step size L, shrinking factor c, matrices Yk-,^giY^). 
2 : Set Z = softh(yfc - LVg{Yk)) 

3: while g{Z) > fL{Z,Yk) do 
4: Update L = cL 

5: end while 
6 : return Lf^ = L 


We may expand the probability generating functions in the following form; 

= = 1,A2(0) = 0) 

oo oo 

k=0 1=0 

oo OO 

= ^ ^(lfc=i,«=o + ai(k, l)t + o(t))si 4 

k=0 1=0 

= Si + ttl(si, S2)t + o(t). 

Of course we have an analogous expression for si, S 2 ) beginning with one particle of type 

2 instead of type 1. For short, we will write (/>io ■= (/>i, (/>oi '= 4>2- 

Thus we have the following relation between the functions (j) and u: 

^{t,Sl,S2)\t=0 = Ui{si,S 2 ) 

^it,Sl,S2)\t=0 = U2{si,S2) 

To derive the backwards and forward equations, Chapman-Kolmogorov arguments yield the 
symmetric relations 

(J)l{t + h,Sl,S2) = (l)l{t,(pl{h,Sl,S2),4>2ih,Sl,S2)) (A-2) 

= si, S 2 ), </> 2 (t, si, S 2 )) (A-3) 

First, we derive the backward equations by expanding around t and applying (2): 

(pi{t + h,si,S2) = S1,S2) + + h, Si, S2)\h=oh + o{h) 

= 4 >l{t,Si,S 2 ) + Si, S2),(j)2it, Si, S2)\h=oh + o{h) 

= (j)l{t. Si, S2) + Ul{(j)l{t, Si, S2), (p2{t, Si, S2)h + o{h)) 

Since an analogous argument applies for (1)2, we arrive at the system 

f ji<f)i{t, Sl, S2) = Ul{<j)l{t, Sl, S2), (j)2{t, Sl, S2)) 

1 Sl, S2) = U2(^l(t, Sl, S2), (j)2{t, Sl, S2)) 


with initial conditions (/>i( 0 , si, S2) = si, (j)2{0, si, S2) = S2- 
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Recall the rates defining the two-compartment hematopoiesis model are given by 


ai(2,0) = p ai(0, l) = z^ ai(l, 0) =-(/>z/) 

02(0,0) = ^ 02(0,1) = -^ 


Thus, the pseudo-generating functions are 


wi(si, S2) = ps\ -h 1^82 -{p + v)si 

U 2 {si, 82 ) = p- P 82 = p{l - 82 ) 

Plugging into the backward equations, we obtain 

^01 (t, si, S 2 ) = P4>i{t, si, S 2 ) + si, S 2 ) -{p + Sl, S 2 ) 

and ^ 

— (/) 2 (t. Si, 82 ) = p- p 4 > 2 {t, Si, S 2 ). 

The (/>2 differential equation corresponds to a pure death process and is immediately solvable: 
suppressing the arguments of < 1)2 for notational convenience, we obtain 


Pluggin in (^> 2 ( 0 , si, S 2 ) 


—4)2 = p- p4)2 


—A 1 ^ 

- 4)2 


) = P 


ln(l - 4>2) = -pt + C 

02 = 1 - ex.p{-pt + C) 


82 , we obtain C = ln(l — S 2 ), and we arrive at 


4>2{t, si, S 2 ) = 1 (s 2 - 1) exp(-;ut) (A-4) 

Plugging this solution into the other backward equation, we obtain 

^01 (f, Si, S 2 ) = /30?(0 Si, S 2 ) - {p + 1^)01 (0 Si, S 2 ) Z^(l (S 2 “ 1) exp(-^f)) (A-5) 

This ordinary differential equation can be solved numerically given rates and values for the 
three arguments, allowing computation of 4>i,j = 0 i 02 which holds by particle independence. 


BDS model diagram 

The branching process components X(f) = (xoid, Xnew) represent the number of originally occupied 
and newly occupied sites at the end of each observation interval. As an example, assume six 
particles (transposons) are present initially at time to, and a shift and a birth occur before the 
first observation ti, and a death occurs before a second observation at t 2 - When considering 
the first observation interval [to,ti), we have {X(fo) = (6,0),X(fi) = (5,2)}. When computing 
the next transition probability over [ti,t 2 ), we now have {X(fi) = (7,0),X(f2) = (6,0)}, since all 
seven of the particles at ti, now the left endpoint of the observation interval, now become the initial 
population. Even with data over time, this seeming inconsistency at the endpoints does not become 
a problem because transition probability computations occur separately over disjoint observation 
intervals. See Xu et al. [2014] for further details. 
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Figure C-4: Illustration of the three types of transposition—birth, death, shift—along a genome, 
represented by circles [Rosenberg et ah, 2003]. Transposons are depicted by rectangles occupying 
locations along the circles/genomes. On the right set of diagrams, a birth event keeps the number 
of type 1 particles intact and increments the number of type 2 particles by one, a death event 
changes the number of type 1 particles from five to four and keeps the number of type 2 particles 
at zero, and finally a shift event decreases the number of type 1 particles by one and increases the 
number of type 2 particles by one. 


18 



